Improved CHAID algorithm for document structure modelling

نویسندگان

  • Abdel Belaïd
  • T. Moinel
  • Yves Rangoni
چکیده

This paper proposes a technique for the logical labelling of document images. It makes use of a decision-tree based approach to learn and then recognise the logical elements of a page. A state-of-the-art OCR gives the physical features needed by the system. Each block of text is extracted during the layout analysis and raw physical features are collected and stored in the ALTO format. The data-mining method employed here is the “Improved CHi-squared Automatic Interaction Detection” (I-CHAID). The contribution of this work is the insertion of logical rules extracted from the logical layout knowledge to support the decision tree. Two setups have been tested; the first uses one tree per logical element, the second one uses a single tree for all the logical elements we want to recognise. The main system, implemented in Java, coordinates the third-party tools (Omnipage for the OCR part, and SIPINA for the I-CHAID algorithm) using XML and XSL transforms. It was tested on around 1000 documents belonging to the ICPR’04 and ICPR’08 conference proceedings, representing about 16,000 blocks. The final error rate for determining the logical labels (among 9 different ones) is less than 6%.

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

A Hybrid DEA Based CHAID and Imperialist Competitive Algorithm for Stock ‎Selection

In this paper, the investment portfolio is formed based on the data mining algorithm of CHAID on the basis of the risk status criteria. In the next step, the second investment portfolio is created based on the decision rules extracted by the DEA-BCC model. The final portfolio is created through a two-objective mathematical programming model based on the Imperialist Competitive algorithm.

متن کامل

Evaluation of Data Mining Algorithms for Detection of Liver Disease

Background and Aim: The liver, as one of the largest internal organs in the body, is responsible for many vital functions including purifying and purifying blood, regulating the body's hormones, preserving glucose, and the body. Therefore, disruptions in the functioning of these problems will sometimes be irreparable. Early prediction of these diseases will help their early and effective treatm...

متن کامل

Stock price prediction using the Chaid rule-based algorithm and particle swarm optimization (pso)

Stock prices in each industry are one of the major issues in the stock market. Given the increasing number of shareholders in the stock market and their attention to the price of different stocks in transactions, the prediction of the stock price trend has become significant. Many people use the share price movement process when com-paring different stocks while investing, and also want to pred...

متن کامل

Proposed Method for Predicting COVID-19 Severity in Chronic Kidney Disease Patients Based on Ant Colony Algorithm and CHAID

Background and Objective: The COVID-19 pandemic is a phenomenon that has infected and killed many people worldwide. Underlying diseases such as diabetes mellitus, heart failure, and chronic kidney disease (CKD) can affect the severity of COVID-19 and aggravate patients' condition. This study aimed to predict the severity of the COVID-19 disease in CKD patients by combining feature selection and...

متن کامل

Inducing Fuzzy Decision Trees in Non-Deterministic Domains using CHAID

Most decision tree induction methods used for extracting knowledge in classification problems are unable to deal with uncertainties embedded within the data, associated with human thinking and perception. This paper describes the development of a novel tree induction algorithm which improves the classification accuracy of decision tree induction in non-deterministic domains. The research involv...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

عنوان ژورنال:

دوره   شماره 

صفحات  -

تاریخ انتشار 2010